How to not be afraid of your data

Ted Laderas

7/9/2017

Needed R Packages

You will need R and Rstudio to run this workshop.

You will need the following packages to do this workshop. Please install these packages using the install.packages() command.

library(tidyverse)
library(shiny)

Introduction

Cribbed from my twitter bio:

Overview

Who’s Afraid of Data?

What’s the Problem?

Datasaurus Dozen

Datasaurus Dozen

The Approach

gRadual exposuRe can lessen fear…

Hadley Wickham’s Data Wrangling Diagram

Hadley Wickham’s Data Wrangling Diagram

What is Exploratory Data Analysis (EDA)?

Remember

“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” - John Tukey, Exploratory Data Analysis

EDA is not cheating!

In contrast to Confirmatory Data Analysis (CDA), such as hypothesis testing, the goals of EDA are to:

EDA is vital when repurposing and reusing data that was collected for another purpose. Don’t go in blind!

EDA: Always read the data dictionary!

Always read the data dictionary if provided! There is often some useful information in there about how the data is represented (such as the units for each column, etc).

Data Dictionary Example

Data Dictionary Example

Some of the EDA tools built into R

Wrap them up in a Shiny wrapper

Dashboard Image

Dashboard Image

Provide the tools to clean up the data

Three commands:

dplyr::filter()

filter() lets you select rows according to a criteria. You can use | (OR) and & (AND) to chain together logical statements.

library(dplyr)
newIris <- iris %>% filter(Species == "setosa" & Sepal.Length > 5)

head(newIris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          5.4         3.9          1.7         0.4  setosa
## 3          5.4         3.7          1.5         0.2  setosa
## 4          5.8         4.0          1.2         0.2  setosa
## 5          5.7         4.4          1.5         0.4  setosa
## 6          5.4         3.9          1.3         0.4  setosa

Note that any statement or function that produces a boolean vector (such as is.na(Species)) can be used here.

dplyr::select()

select() lets you select columns in your dataset.

Remember: “filter() works on rows, select() works on columns.” - Chester’s Mantra

library(dplyr)
newIris <- iris %>% select(Sepal.Width, Species)

head(newIris)
##   Sepal.Width Species
## 1         3.5  setosa
## 2         3.0  setosa
## 3         3.2  setosa
## 4         3.1  setosa
## 5         3.6  setosa
## 6         3.9  setosa

dplyr::mutate()

mutate() is one of the most useful dplyr commands. You can use it to transform data and add it as a new column into the data.frame:

library(dplyr)
newIris <- iris %>% mutate(sepalSum = Sepal.Length + Sepal.Width)

head(newIris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepalSum
## 1          5.1         3.5          1.4         0.2  setosa      8.6
## 2          4.9         3.0          1.4         0.2  setosa      7.9
## 3          4.7         3.2          1.3         0.2  setosa      7.9
## 4          4.6         3.1          1.5         0.2  setosa      7.7
## 5          5.0         3.6          1.4         0.2  setosa      8.6
## 6          5.4         3.9          1.7         0.4  setosa      9.3
#add a column with the same value for each entry
newIris <- newIris %>% mutate(value = "Site1")

head(newIris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepalSum value
## 1          5.1         3.5          1.4         0.2  setosa      8.6 Site1
## 2          4.9         3.0          1.4         0.2  setosa      7.9 Site1
## 3          4.7         3.2          1.3         0.2  setosa      7.9 Site1
## 4          4.6         3.1          1.5         0.2  setosa      7.7 Site1
## 5          5.0         3.6          1.4         0.2  setosa      8.6 Site1
## 6          5.4         3.9          1.7         0.4  setosa      9.3 Site1

Chaining dplyr commands using %>%

The power of dplyr comes from the fact that you can chain multiple steps.

Example: Let’s calculate a new column SepalMean on iris and filter the dataset on this new variable.

library(dplyr)
data(iris)

iris2 <- iris %>% mutate(SepalMean = (Sepal.Length + Sepal.Width) / 2) %>%
  filter(SepalMean > 4)

nrow(iris)
## [1] 150
nrow(iris2)
## [1] 123
head(iris2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species SepalMean
## 1          5.1         3.5          1.4         0.2  setosa      4.30
## 2          5.0         3.6          1.4         0.2  setosa      4.30
## 3          5.4         3.9          1.7         0.4  setosa      4.65
## 4          5.0         3.4          1.5         0.2  setosa      4.20
## 5          5.4         3.7          1.5         0.2  setosa      4.55
## 6          4.8         3.4          1.6         0.2  setosa      4.10

Using the EDA Shiny App

For the workshop, you’ll use the Shiny App to do some EDA. You will need to be familiar with the basic architecture of the app.

Shiny Architecture

Shiny Architecture

The Problem

An experimental weight loss drug was first tested at one site with volunteers (DatasetA). Given the small sample size, volunteers from an additional site were recruited (DatasetB). Read the data dictionaries!

  1. Your goal is to conduct EDA on the two separate datasets to assess whether there was an effect from the weight loss drug. If you notice any issues in the data, will you need to filter or transform the data?
  2. Given your EDA of both datasets, can you combine the datasets into a single dataset for analysis? What will you need to do to compare them?

Go as far as you can. Remember to use your post-its to show your status and whether you need help.

Go For It!

EDA is a puzzle with real-world consequences. Use your tools to understand the data!

Get the workshop materials here:

git clone http://github.com/laderast/shinyEDA

in the shinyEDA/ folder, open up the .Rproj file

Datasets are in the data/ folder along with the data dictionaries and readmes.

Read weightLossAssignment.pdf for more details.

Discussion and Walkthrough

For each issue with the data:

Load your own data

You can load your own datasets for exploration and cleaning - just assign them the name dataset.

The dashboard tries to detect what are numeric variables (numeric) and what are categorical variables (ordered, factor). So you may need to set the type for each of the variables.

More uses of the EDA Shiny App

Cardiovascular Risk Prediction Workshop:

https://github.com/laderast/cvdNight1

Many more workshops in the future!

Acknowledgements

This work was funded by a Big Data to Knowledge (BD2K) T25 Grant: 1R25EB020379-01

Suggestions? Comments?

Happy to Talk about the Shiny App

Feel free to fork this the shiny app for your own purposes. It’s designed to be a simple introduction to Shiny as well.